Here we will map scRNA-seq data from B-ALL patients onto an atlas of human B cell development. First, we will map the cells onto a complete hematopoietic reference (BoneMarrowMap) and next we will map cells along the B-lymphoid trajectory onto a more detailed reference of human B cell development derived from fetal and post-natal samples.
Since there are two rounds of classification, this complete tutorial will take approximately 20 minutes to run.
library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
Install package from github
## install dependencies that are not on CRAN
if(!require(BiocManager, quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("AUCell", "doMC"))
if(!require(devtools, quietly = TRUE)) install.packages("devtools")
devtools::install_github("jaredhuling/jcolors")
## install BoneMarrowMap package
devtools::install_github('andygxzeng/BoneMarrowMap', force = TRUE)
Downloading GitHub repo andygxzeng/BoneMarrowMap@HEAD
For our example data, we will use single cell transcriptomes from two B-ALL patients with BCR::ABL1 driven disease. These two patients represent distinct BCR::ABL1 subtypes - Multipotent and Committed.
curl::curl_download('https://bdevelopmentmap.s3.us-east-2.amazonaws.com/ExampleQuery_BALL_scRNAseq.rds',
destfile = paste0(projection_path, 'ExampleQuery_BALL_scRNAseq.rds'))
query <- readRDS('ExampleQuery_BALL_scRNAseq.rds')
query
An object of class Seurat
36601 features across 13265 samples within 1 assay
Active assay: RNA (36601 features, 0 variable features)
We will first map our single-cell transcriptomes onto BoneMarrowMap. This will provide a classification for all cell types within the sample. After this, we will subset the cells residing along the B-lymphoid trajectory and project them onto our focused reference map of human B cell development.
Let’s start by loading the reference atlas of human hematopoiesis - BoneMarrowMap
# Set directory to store projection reference files
projection_path = './'
# Download Bone Marrow Reference - 344 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrowMap_SymphonyReference.rds',
destfile = paste0(projection_path, 'BoneMarrowMap_SymphonyReference.rds'))
# Download uwot model file - 221 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrowMap_uwot_model.uwot',
destfile = paste0(projection_path, 'BoneMarrowMap_uwot_model.uwot'))
# Load Symphony reference
BM_ref <- readRDS(paste0(projection_path, 'BoneMarrowMap_SymphonyReference.rds'))
# Set uwot path for UMAP projection
BM_ref$save_uwot_path <- paste0(projection_path, 'BoneMarrowMap_uwot_model.uwot')
If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information
BM_ref_obj <- create_ReferenceObject(BM_ref)
DimPlot(BM_ref_obj, reduction = 'umap', group.by = 'CellType_Annotation_formatted', raster=FALSE, label=TRUE, label.size = 4)
We can also visualize broader cell type labels which may simplify interpretation and downstream analysis.
DimPlot(BM_ref_obj, reduction = 'umap', group.by = 'CellType_Broad',
raster=FALSE, label=TRUE, label.size = 4)
We can visualize other annotations too, including cell cycle phase and lineage pseudotime estimates.
p1 <- DimPlot(BM_ref_obj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(BM_ref_obj, reduction = 'umap', features = 'Pseudotime', raster=FALSE)
p1 + p2
Now we can map our B-ALL samples onto the normal hematopoietic hierarchy.
Provide raw counts, metadata, and donor key. This should take <1 min. Calculate mapping error and perform QC to remove low quality cells with high mapping error
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'Sample'
# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query(
exp_query = query@assays$RNA@counts,
metadata_query = query@meta.data,
ref_obj = BM_ref,
vars = batchvar
)
Normalizing
Scaling and synchronizing query gene expression
Found 2360 reference variable genes in query dataset
Project query cells using reference gene loadings
Clustering query cells to reference centroids
Correcting query batch effects
UMAP
All done!
Warning: Adding features not currently present in the object
In leukemia samples, the distribution of mapping error scores can vary broadly from sample to sample. In this context, we will want to threshold outliers with high mapping error on a per-sample basis. Typically, a threshold of 2, 2.5, or 3 MADs works well.
In some cases where sequencing depth is very low (e.g. older datasets from first-generation scRNA-seq protocols), a more stringent threshold of even 1.5 may be warranted to eliminate cells with low mapping quality
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = BM_ref, MAD_threshold = 2.5,
threshold_by_donor = TRUE, donor_key = batchvar) # threshold mapping error on a per-sample basis.
# Plot distribution by patient to ensure you are catching the tail
query@meta.data %>%
ggplot(aes(x = mapping_error_score, fill = mapping_error_QC)) +
geom_histogram(bins = 200) + facet_wrap(.~get(batchvar))
# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)
# Plot together - If this is too crowded, can also just call "QC_plots" aloneto display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))
This important step identifies a subset of cells with high mapping error from the query dataset that are either:
Sometimes, low quality cells may erroneously map to the orthochromatic erythroblast region as this cell type has very low transcriptional diversity. These low quality query cells do not have hemoglobin expression and are in fact mis-mapped; they will be flagged by the QC filter and excluded from cell type assignments.
Please adjust the MAD_threshold (typically between 1 and 3) based on the distribution of your dataset to identify the outliers with low quality and high mapping error scores. This will improve your classifications and any downstream composition analysis
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
Optionally, outlier cells with high mapping error can also be removed at this stage. For ease of integrating these mapped annotations with the rest of your analysis, we can choose to skip this step. If so, Final CellType and Pseudotime predictions will be assigned as NA for cells failing the mapping error QC threshold.
We will next use a KNN classifier to assign cell identity based on the 30 K-Nearest Neighbours from the reference map. Broader cell type labels will also be transferred automatically along with the precise cell type labels. This label transfer step will take longer, potentially around 10 minutes for ~10,000 cells
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
query_obj = query,
ref_obj = BM_ref,
initial_label = 'initial_CellType_BoneMarrowMap', # celltype assignments before filtering on mapping QC
final_label = 'predicted_CellType_BoneMarrowMap' # celltype assignments with map QC failing cells assigned as NA
)
DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap_projected', group.by = c('predicted_CellType_BoneMarrowMap'),
raster=FALSE, label=TRUE, label.size = 4)
We can also visualize the broader cell type categories in case precise labels are too granular. These provide a simpler view of the data and can help guide cluster annotations if your dataset does not have many cells.
DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap_projected',
group.by = c('predicted_CellType_BoneMarrowMap_Broad'), raster=FALSE, label=TRUE, label.size = 4)
We can also annotate each query cell based on their position along hematopoietic pseudotime. Query cells will be assigned a pseudotime score based on the 30 K-Nearest Neighbours from the reference map. Since our Pseudotime KNN assignments are performed in UMAP space (more accurate than KNN on harmony components), this step is very fast (< 10s)
library(RColorBrewer)
Warning: package ‘RColorBrewer’ was built under R version 4.1.2
# Predict Pseudotime values by KNN
query <- predict_Pseudotime(
query_obj = query,
ref_obj = BM_ref,
initial_label = 'initial_Pseudotime', # pseudotime assignments before filtering on mapping QC
final_label = 'predicted_Pseudotime' # pseudotime assignments with map QC failing cells assigned as NA
)
# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'), split.by = 'Sample') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Now let’s visualize the density distribution of query cells across the hematopoietic hierarchy
# Set batch/condition to be visualized individually
batch_key <- 'Sample'
# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
query_obj = query,
batch_key = batch_key,
ref_obj = BM_ref,
Hierarchy_only = FALSE, # Whether to exclude T/NK/Plasma/Stromal cells
downsample_reference = TRUE,
downsample_frac = 0.25, # down-sample reference cells to 25%; reduces figure file size
query_point_size = 0.2, # adjust size of query cells based on # of cells
saveplot = TRUE,
save_folder = 'projectionFigures/'
)
# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 2)
We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal cells and focus solely on the hematopoietic hierarchy.
# Set batch/condition to be visualized individually
batch_key <- 'Sample'
# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
query_obj = query,
batch_key = batch_key,
ref_obj = BM_ref,
Hierarchy_only = TRUE, # Whether to exclude T/NK/Plasma/Stromal cells
downsample_reference = TRUE,
downsample_frac = 0.25, # down-sample reference cells to 25%; reduces figure file size
query_point_size = 0.2, # adjust size of query cells based on # of cells
saveplot = TRUE,
save_folder = 'projectionFigures/'
)
# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 2)
Let’s save the results of the mapping onto the complete hematopoietic hierarchy.
saveRDS(query, 'QueryData_Mapped_CompleteHematopoiesis.rds')
Having completed one round of mapping and obtaining initial classifications, we will now map cell along the B-cell differentiation trajectory to our refined map of human B cell development. This will provide us with finer classifications along the B cell differentiation trajectory.
This second reference map was constructed by pooling HSPCs and B-lymphoid progenitors from across ontogeny (fetal, cord blood, adult bone marrow) and achieves greater resolution of the fine cell states spanning human B cell development (n = 130,085 cells).
Let’s start by loading this reference atlas of human B cell development.
# Set directory to store projection reference files
projection_path = './'
# Download Bone Marrow Reference - 187 Mb
curl::curl_download('https://bdevelopmentmap.s3.us-east-2.amazonaws.com/BDevelopmentMap_SymphonyReference.rds',
destfile = paste0(projection_path, 'BDevelopmentMap_SymphonyReference.rds'))
# Download uwot model file - 99 Mb
curl::curl_download('https://bdevelopmentmap.s3.us-east-2.amazonaws.com/BDevelopmentMap_uwot_model.uwot',
destfile = paste0(projection_path, 'BDevelopmentMap_uwot_model.uwot'))
# Load Symphony reference
bdev_ref <- readRDS(paste0(projection_path, 'BDevelopmentMap_SymphonyReference.rds'))
# Set uwot path for UMAP projection
bdev_ref$save_uwot_path <- paste0(projection_path, 'BDevelopmentMap_uwot_model.uwot')
If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information
bdev_ref_obj <- create_ReferenceObject(bdev_ref)
DimPlot(bdev_ref_obj, reduction = 'umap', group.by = 'BDevelopment_CellType', raster=FALSE, label=TRUE, label.size = 4)
Here we will pick up where we left off - using the mapped query object, we can must subset the cell types proximal to the B cell differentiation trajectory. Using this new subsetted object, we will map these cells onto our refined atlass of B cell differentiation.
B_development_celltypes <- c('HSC', 'MPP-MyLy', 'LMPP', 'Early GMP', 'MLP', 'MLP-II', 'Pre-pDC', 'Pre-pDC Cycling', 'pDC',
'CLP', 'Pre-ProB', 'Pro-B VDJ', 'Pro-B Cycling', 'Large Pre-B', 'Small Pre-B', 'Immature B', 'Mature B')
query_Bdev <- subset(query, predicted_CellType_BoneMarrowMap %in% B_development_celltypes)
query_Bdev
An object of class Seurat
36601 features across 11572 samples within 1 assay
Active assay: RNA (36601 features, 0 variable features)
3 dimensional reductions calculated: pca_projected, harmony_projected, umap_projected
Provide raw counts, metadata, and donor key. This should take <1 min Calculate mapping error and perform QC to remove low quality cells with high mapping error
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'Sample'
# Map query dataset using Symphony (Kang et al 2021)
query_Bdev <- map_Query(
exp_query = query_Bdev@assays$RNA@counts,
metadata_query = query_Bdev@meta.data,
ref_obj = bdev_ref,
vars = batchvar
)
Normalizing
Scaling and synchronizing query gene expression
Found 950 reference variable genes in query dataset
Project query cells using reference gene loadings
Clustering query cells to reference centroids
Correcting query batch effects
UMAP
All done!
We have now mapped our cells onto the refined atlas of B-cell development. Before we refine our cell type classifications, we can visualize coarse cell type classifications from our first round of projection using BoneMarrowMap.
DimPlot(subset(query_Bdev, mapping_error_QC == 'Pass'), reduction = 'umap_projected',
group.by = c('predicted_CellType_BoneMarrowMap_Broad'), raster=FALSE, label=TRUE, label.size = 4)
In order to refine cell type assignments along B-cell development, we will use the 30 K-Nearest Neighbours from the B-cell development reference map.
# Predict Hematopoietic Cell Types by KNN classification
query_Bdev <- predict_CellTypes(
query_obj = query_Bdev,
ref_obj = bdev_ref,
ref_label = 'BDevelopment_CellType', ## for a more detailed annotation, use BDevelopment_CellType_Comprehensive
initial_label = 'initial_CellType_BDevelopment', # celltype assignments before filtering on mapping QC
final_label = 'predicted_CellType_BDevelopment' # celltype assignments with map QC failing cells assigned as NA
)
DimPlot(subset(query_Bdev, mapping_error_QC == 'Pass'), reduction = 'umap_projected',
group.by = c('predicted_CellType_BDevelopment'), raster=FALSE, label=TRUE, label.size = 4)
Now let’s visualize the density distribution of query cells across the hematopoietic hierarchy
# Set batch/condition to be visualized individually
batch_key <- 'Sample'
# returns a list of plots for each donor from a pre-specified batch variable
projection_plots_bdev <- plot_Projection_byDonor(
query_obj = query_Bdev,
batch_key = batch_key,
ref_obj = bdev_ref,
downsample_reference = TRUE,
downsample_frac = 0.25, # down-sample reference cells to 25%; reduces figure file size
query_point_size = 0.2, # adjust size of query cells based on # of cells
saveplot = TRUE,
save_folder = 'projectionFigures/'
)
# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots_bdev, ncol = 2)
saveRDS(query_Bdev, 'QueryData_Mapped_BcellDevelopment.rds')
Here, to study the abundance of each cell type within each donor, I focus on cells that were classified with a KNN prob > 0.5 (that is, >50% of nearest neighbours from the reference map agree on the assigned cell type).
We can present this as a long table
query_composition <- get_Composition(
query_obj = query_Bdev,
donor_key = 'Sample',
celltype_label = 'predicted_CellType_BDevelopment',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'long')
query_composition
Or as a wide table with counts of # of cells
query_composition <- get_Composition(
query_obj = query_Bdev,
donor_key = 'Sample',
celltype_label = 'predicted_CellType_BDevelopment',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'count')
query_composition
Or as a wide table with proportion of each cell type within each donor
query_composition <- get_Composition(
query_obj = query_Bdev,
donor_key = 'Sample',
celltype_label = 'predicted_CellType_BDevelopment',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'proportion')
query_composition
For downstream analysis, you can use the projected celltype labels to help annotate any leukemia cell clusters generated through unsupervised dimensionality reduction and clustering from individual patients. Sometimes, unsupervised analysis will yield clusters corresponding to different developmental states and other times it may yield clusters corresponding to distinct subclones within the patient. Along with tools like inferCNV for patients with known cytogenetic abnormalities, this can be integrated to visualize cellular hierarchies at the level of individual subclones.
For downstream analysis, let’s take the example of SJPHALL007_D, with involvement of multipotent early lymphoid cells along with committed B-cell precursors.
SJPHALL007D <- query_Bdev %>%
subset(., Sample == 'SJPHALL007_D') %>%
SCTransform() %>%
RunPCA() %>%
RunUMAP(reduction = 'pca', dims=1:20) %>%
FindNeighbors(reduction = 'pca', dims = 1:20) %>%
FindClusters(resolution = 0.8, algorithm = 3)
SJPHALL007D <- CellCycleScoring(SJPHALL007D, s.features = cc.genes.updated.2019$s.genes,
g2m.features = cc.genes.updated.2019$g2m.genes, set.ident = F)
DimPlot(SJPHALL007D, reduction = 'umap', group.by=c('seurat_clusters', 'Phase'), label = F)
Here we see multiple cell populations, including at least two major B-ALL blast populations. Let’s overlay the course predictions from BoneMarrowMap along with the finer predictions from the B cell development map.
DimPlot(SJPHALL007D, reduction = 'umap', group.by=c('predicted_CellType_BoneMarrowMap_Broad',
'predicted_CellType_BDevelopment'), label = T)
This reveals that one major blast population of committed Pro-B cells within the patient sample and another major blast population of early lymphoid progenitors that are contributing to both B cell development as well as plasmacytoid dendritic cell development. At the bottom right, there is a population of likely healthy immature B cells.